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Abstract 

<N 

5^ ■ In this paper we consider the use of probabilistic or random models within a classical trust- 

Q-ij region framework for optimization of deterministic smooth general nonlinear functions. Our 

method and setting differs from many stochastic optimization approaches in two principal 
ways. Firstly, we assume that the value of the function itself can be computed without 
noise, in other words, that the function is deterministic. Secondly, we use random models 
of higher quality than those produced by usual stochastic gradient methods. In particular, 
a first order model based on random approximation of the gradient is required to provide 
sufficient quality of approximation with probability greater than or equal to 1/2. This is in 
pH ■ contrast with stochastic gradient approaches, where the model is assumed to be "correct" 

only in expectation. 

As a result of this particular setting, we are able to prove convergence, with probability 
one, of a trust-region method which is almost identical to the classical method. Hence we 
show that a standard optimization framework can be used in cases when models are random 
and may or may not provide good approximations, as long as "good" models are more likely 
than "bad" models. Our results are based on the use of properties of martingales. Our 
motivation comes from using random sample sets and interpolation models in derivative-free 
optimization. However, our framework is general and can be applied with any source of 



r-o \ uncertainty in the model. We discuss various applications for our methods in the paper. 
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1 Introduction 

1.1 Motivation 

The focus of this paper is the analysis of a numerical scheme that utilizes randomized models 
to minimize deterministic functions. In particular, our motivation comes from algorithms for 
minimization of so-called black-box functions where values are computed, e.g., via simulations. 
For such problems, function evaluations are costly and derivatives are typically unavailable and 
cannot be approximated. Such is the setting of derivative- free optimization (DFO), of which 
the list of applications — including molecular geometry optimization, circuit design, groundwater 
community problems, medical image registration, dynamic pricing, and aircraft design (see the 
references in [15]) — is diverse and growing. Nevertheless, our framework is general and is not 
limited to the setting of derivative-free optimization. 

There is a variety of evidence supporting the claim that randomized models can yield both 
practical and theoretical benefits for deterministic optimization. A primary example is the recent 
success of stochastic gradient methods for solving large-scale machine learning problems. As 
another example, the randomized coordinate descent method for large-scale convex deterministic 
optimization proposed in [21] yields better complexity results than, e.g., cyclical coordinate 
descent. Most contemporary randomized methods generate random directions along which all 
that may be required is some minor level of descent in the objective /. The resulting methods 
may be very simple and enjoy low per-iteration complexity, but the practical performance of 
these approaches can be very poor. On the other hand, it was noted in [5] that the performance of 
stochastic gradient methods for large-scale machine learning improves substantially if the sample 
size is increased during the optimization process. Within direct search, the use of random positive 
bases has also been recently investigated [UEI with gains in performance and convergence theory 
for nonsmooth problems. This suggests that for a wide range of optimization problems, requiring 
a higher level of accuracy from a randomized model may lead to more efficient methods. Thus, 
our primary goal is to design randomized numerical methods that do not rely on producing 
descent directions "eventually", but provide accurate enough approximations so that in each 
iteration a sufficiently improving step is produced with high probability (in fact, probability 
greater than half is sufficient in our analysis). We incorporate these models into a trust-region 
framework so that the resulting algorithm is able to work well in practice. 

Our motivation originates with model-based DFO methodology (e.g., see [14^ [T5] ) where 
local models of / are built from function values sampled in the vicinity of a given iterate. To 
date, most algorithms of this type have relied on sample sets that are generated by the algorithm 
steps or added in a deterministic manner. A complex mechanism of sample set maintenance is 
necessary to ensure that the quality of the models is acceptable, while the expense of sampling 
the function values is not excessive. Various approaches have been developed for this mechanism, 
which achieve different trade-offs for the number of sample points required, the computational 
expense of the mechanism itself, and the quality of the models. One of the primary premises of 
this paper is the assumption that using random sample sets can yield new and better trade-offs. 
That is, randomized models can maintain a higher quality by using fewer sample points without 
complex maintenance of the sample set. One example of such a situation is described in [3], 
where linear or quadratic polynomial models are constructed from random sample sets. It is 
shown that one can build such models, meeting a Taylor type accuracy with high probability, 
using significantly less sample points than what is needed in the deterministic case, provided 



the function being modeled has sparse derivatives. 

The framework considered by us in the current paper is sufficiently broad to encompass any 
situation where the quality or accuracy of the trust-region models is random. In particular, such 
models can be built directly using some form of derivative information, as long as it is accurate 
with certain probability. 

1.2 Trust-region framework 

The trust-region method introduced and analyzed in this paper is rather simple. At each iter- 
ation one solves a trust-region subproblem, i.e., one minimizes the model within a trust-region 
ball. Note that one does not know whether the model is accurate or not. If the trust-region step 
yields a good decrease in the objective function relatively to the decrease in the model and the 
trust-region radius is sufficiently small relatively to the size of the model gradient, then the step 
is taken and the trust-region radius is possibly increased. Otherwise the step is rejected and the 
trust-region radius is decreased. We show that such a method always drives the trust-region 
radius to zero. 

Based on this property we show that, provided the (first order) accuracy of the model occurs 
with probability no smaller than 1/2, possibly conditioned to the prior iteration history, then the 
gradient of the objective function converges to zero with probability one. Our proof technique 
relies on building random processes from the random events defined by the models being or not 
accurate (conditioned to the past), and then making use of their submartingale-like properties. 
We extend the theory to the case when the models of sufficient second order accuracy occur with 
probability no smaller than 1/2. We show that a subsequence of the iterates drive a measure of 
second order stationarity to zero with probability one. However, to demonstrate the lim-type 
convergence to a second order stationary point we need additional assumptions on the models. 

1.3 Notation 

Several constants are used in this paper to bound various quantities. These constants are denoted 
by k with acronyms for the subscripts that are indicative of the quantities that they are meant 
to bound. We list their most used definitions here, for convenience. The actual meaning of the 
constants will become clear when each of them is introduced in the paper. 

Kfcd "fraction of Cauchy decrease" 

Kfod "fraction of optimal decrease" 

KL g "the Lipschitz constant of the gradient of the function' 

KLh "the Lipschitz constant of the Hessian of the function" 

kl t "the Lipschitz constant of the measure r of second order stationarity of the function" 

K e f "error in the function value" 

K eg "error in the gradient" 

K e h "error in the Hessian" 

K eT "error in the r measure" 

Kbhm "bound on the Hessian of the models" 

K bhf "bound on the Hessian of the function" 

This paper is organized as follows. In Section [2] we briefly describe existing methods for 
derivative-free optimization and provide an illustrative example to motivate the use of random 



models. In Section [3] we introduce the probabilistic models of the first order and the trust-region 
method based on such models. The convergence of the method to first order criticality points 
is proved in Section HI The second order case is addressed in Section [SJ Finally, in Section [UJ we 
describe various useful random models that satisfy the conditions needed for convergence results 
in Sections [3] and [SJ 

2 Methods of derivative-free optimization 

We consider in this paper the unconstrained optimization problem 

min fix), 

where the first (and second, in some cases) derivatives of the objective function f(x) are assumed 
to exist and be Lipschitz continuous. However, as it is considered in derivative-free optimization 
(DFO), explicit evaluation of these derivatives is assumed to be impossible. Derivative- free 
methods rely on sampling the objective function at one or more points at each iteration. Some 
sample to explore directions, others to build models. 

Directional methods. Direct-search methods of directional type were first developed using 
a single positive basis or a finite number of them (see the surveys [2UJ and [151 Chapter 8]). 
The basic versions of these methods, like coordinate or compass search, are inherently slow for 
problems of more than a few variables, not only because they are not able to use curvature infor- 
mation and rarely reuse sample points, but also because they rely on few directions. They were 
shown to be globally convergent for smooth problems [32] and had their worst case complexity 
measured by global rates [33] . 

Not restricting direct search to a finite number of positive bases was soon discovered to 
enhance practical performance. Approaches allowing for an infinite number of positive bases 
were proposed in [TJ[20], with results applicable to nonsmooth functions when the generation is 
dense in the unit sphere (see pjj IM]). 

On the other hand, randomized stochastic methods recently became a popular alternative 
to direct-search methods. These methods also sample the objective function along a certain 
direction, but instead of choosing a direction from a positive basis, these methods select direc- 
tions totally randomly. This often allows for faster convergence because "good" directions are 
occasionally observed. The random search approach introduced in [21] samples points from a 
Gaussian distribution. Convergence of an improved scheme was shown in [25]. In [23] . Nesterov 
recently presented several derivative-free random search schemes and provided bounds for their 
global rates. Different improvements of these methods emerged in the latest literature, e.g., |19j . 
Although complexity results for both convex and nonsmooth nonconvex functions are available 
for randomized search, the practical usefulness of these methods is limited by the fixed step 
sizes determined by the complexity analysis and, as in direct search, by the lack of curvature 
information. 

Model-based trust-region methods. Model-based DFO methods developed by Powell [261 
[271 ESI EH] , and by Conn, Scheinberg, and Toint [HUE], introduced a class of trust-region methods 
that relied on interpolation or regression based quadratic approximations of the objective func- 
tion instead of the usual Taylor series quadratic approximation. The regression-based method 





Figure 1: CS: num.eval.=11307, / = 1(T 6 , DSR: num.eval.=5756, / = 1CT 8 . 

was later successfully used in [3] based on [T3] . In all cases the models are built based on sample 
points in reasonable proximity to the current best iterate. The computational study of More 
and Wild [22] has shown that these methods are typically significantly superior in practical per- 
formance to the other existing approaches due to the use of models that effectively capture the 
local curvature of the objective function. While the model quality is undoubtedly essential for 
the performance of these methods, guaranteeing sufficient quality at all times is quite expensive 
computationally. Randomized models, on the other hand, can offer a suitable alternative by 
providing a good quality approximation with high probability. 



An illustration of directional and model-based methods. 

Rosenbrock function for our computational illustration 



Consider the well known 



f(x) = 100(x 2 



4) 2 + 



l-Xl) 



The function is known to be difficult for first order or zero order methods and well suited 
for second order methods. Nevertheless, some first/zero order methods perform reasonably, 
while others perform poorly. In Figures [TH2] we present the contours of the function and plot 
the iterates produced by four methods: 1) a simple variant of direct search, the coordinate or 
compass search method (CS) which uses the positive basis [/ —I], 2) a direct-search method using 
the positive basis [Q — Q] where Q is an orthogonal matrix obtained by randomly generating 
the first column (DSR), 3) a random search (RS) with step size inversely proportional to the 
iteration count, and 4) a basic model-based trust-region method with quadratic models (TRQ). 
The outcome of the algorithms is summarized in the caption, which lists the number of function 
evaluations and the final accuracy for each method. 

It is evident from these results that the random directional approaches, and in particular 
random search, are more successful at finding good directions for descent, while the coordinate 
search is slow due to the fixed choice of the search directions. It is also clear, from the perfor- 
mance of the second order trust-region method on this problem, that using accurate models can 
substantially improve efficiency. It is natural, thus, to consider the effects of randomization in 
model-based methods. In particular we consider methods that use models built from randomly 
sampled points in hopes of obtaining better models. 





Figure 2: RS: num.eval.=3724, / = 1(T 8 , TRQ: num.eval.=62, / = 10 
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3 First order trust-region method based on probabilistic models 

Let us consider the classical trust-region method setting and notation (see [15] for a similar 
description). At iteration k, f is approximated by a model m k within the ball B(xk, 5k) centered 
at x k and of radius 5k- Then the model is minimized (or approximately minimized) in the ball 
to possibly obtain Xk+i- In this section we will introduce and analyze a trust-region algorithm 
based on probabilistic models, i.e., models m k which are built in a random fashion. First we 
discuss these models and state what will be assumed from them. 



3.1 The probabilistically fully linear models 

For simplicity of the presentation, we consider only quadratic models, written in the form 

m k (x k + s) = m k (xk) + s T g k + -s T H k s, 

where g k = Vm k (x k ) and Hk = V 2 m,k{xk)- Our analysis is not, however, dependent on the 
models being quadratic. 

Let us start by introducing a measure of (linear or first order) accuracy of the model m k . 

Definition 3.1 We say that a function m k is a (K eg , n e f) -fully linear model of f on B(x k ,5 k ) 
if, for every s £ B(0,5 k ), 



II V/(xfe + s) - Vm k (x k + s) 
\\f(x k + s) -m(x k + s) 



< 
< 



K e f5\. 



The concept of fully linear models is introduced in [T3] and [15 j . but here we use the notation 
proposed in [3] . In [15^ Chapter 6] there is a detailed discussion on how to construct and maintain 
deterministic fully linear models. 

For the case of random models, the key assumption in our convergence analysis is that these 
models exhibit good accuracy with sufficiently high probability. We will consider random models 
M k , and then use the notation m k = M k {uj k ) for their realizations. The randomness of the 
models will imply the randomness of the current point x k and the current trust-region radius 5 k . 
Thus, in the sequel, these random quantities will be denoted by X k and A k , respectively, while 
x k = X k (uj k ) and 5 k = A k (uj k ) denote their realizations. 
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Definition 3.2 We say that a sequence of random models {Mk} is (p) -probabilistically (K eg , « e /)- 
fully linear for a corresponding sequence {B(X k ,A k )} if the events 

S k = {M k is a (K eg ,K e f) -fully linear model of f on B(X k ,A k )} 

satisfy the following submartingale-like condition 

PiSklFiU) > p, 

where F^_ l = cr(Mo, . . . , Mfc_i) is i/ie a-algebra generated by Mo, . . . ,Mfe_i. Furthermore, if 
P > 2' ^ en we sa 2/ ^ a ^ ^ e random models are probabilistically (K eg ,K e f) -fully linear. 

Note that M^ is a random model that encompasses all the randomness of iteration & of our 
algorithm. Definition 13.21 serves to enforce the following property: even though the accuracy of 
M k may be dependent of the history of the algorithm, (Mi, . . . , M k _i), it is sufficiently good with 
probability at least p, regardless of that history. We believe this condition is more reasonable 
than assuming complete independence of M k from the past, which is difficult to ensure given 
that the current iterate, around which the model is built, and the trust-region radius depend on 
the algorithm history. 

Now we discuss the corresponding assumptions on the models realizations that we use in the 
algorithm. The first assumption guarantees that we are able to adequately minimize (or reduce) 
the model at each iteration of our algorithm. 

Assumption 3.1 For every k, and for all realizations m k of M k (and of X k and A k ), we are 
able to compute a step s k such that 

I \ i , \ - K fcd\\ ii ■ ) \\9k\\ j- 1 /, s 

m k (x k ) -m k {x k + Sk) > -^— ||gfc||min^ ,d k } , (1) 

2 I ll^fcll J 

for some constant Kfd £ (0, 1]. We say in this case that s k has achieved a fraction of Cauchy 
decrease. 

The Cauchy step itself, which is the minimizer of the quadratic model within the trust region 
and along the negative model gradient —g k , trivially satisfies this property with Kf c d = 1. 
We also assume a uniform bound on the model Hessians: 

Assumption 3.2 There exists a positive constant Kbhm, such that for every k, the Hessians H k 
of all realizations m k of M k satisfy 

\\H k \\ < Kbhm- (2) 

The above assumption is introduced for convenience. While it is possible to show our results 
without this assumption, it is not restrictive in the case of fully linear models. In particular, 
one can construct fully linear models with arbitrarily small \\H k \\ using interpolation techniques. 
In the case of models that, fortuitously, have large Hessian norms, because they are not fully 
linear, we can simply set the Hessian to some other matrix of a smaller norm (or zero). 



3.2 Algorithm and basic properties 

Let us consider the following simple trust-region algorithm. 

Algorithm 3.1 Fix the positive parameters i}\, 772, 7, <5 max with 7 > 1. At iteration k ap- 
proximate the function f in B(x k ,8 k ) by mk and then approximately minimize mk in B(xk, 8k), 
computing Sk so that it satisfies a fraction of Cauchy decrease (pQ) . Let 

f(Xk) ~ f{Xk + s k ) 

Pk - 



m(xk) -m(x k + s k )' 

If Pk > Vi an d Ibfcll > 1728k, set x k +i = x k + s k and 8 k+ i = min{7<5 fc , 8 max }. Otherwise, set 
Xk+i = Xk and Sk+i = r )~ 1 8k- Increase k by one and repeat the iteration. 

This is a basic trust-region algorithm, with one specific modification: the trust-region radius 
is always increased if sufficient function reduction is achieved, that is the step is successful, 
and the trust-region radius is small compared to the norm of the model gradient. The logic 
behind this update follows from the fact that the step size obtained by the model minimization 
is typically proportional to the norm of the model gradient, hence the trust region should be of 
comparable size also. Later we will show how the algorithm can be modified to allow for the 
trust-region radius to remain unchanged in some iterations. 

Each realization of the algorithm defines a sequence of realizations for the corresponding 
random variables, in particular: m& = Mk(uJk), Xk = X k (u)k), 8k = Ak(uJk)- 

For the purpose of proving convergence of the algorithm to first order critical points, we 
assume that the function / and its gradient are Lipschitz continuous in regions considered by 
the algorithm realizations. To define this region we follow the process in [T3]. Suppose that xq 
(the initial iterate) is given. Then all the subsequent iterates belong to the level set 

L(x ) = {x£R n : f(x)<f(x )}. 

However, the failed iterates may lie outside this set. In the setting considered in this paper, all 
potential iterates are restricted to the region 

Leni(xo) = L(x )U [J B(x,8 m3iX ) = [J B(x,8 max ), 

x£L(xo) x£L(xo) 

where <5 max is the upper bound on the size of the trust regions, as imposed by the algorithm. 

Assumption 3.3 Suppose xq and <5 max are given. Assume that f is continuously differ entiable 
in an open set containing the set L en i(xo) and that V/ is Lipschitz continuous on L en i(xo) with 
constant n^g- Assume also that f is bounded from below on L{xq). 

The following lemma states that the trust-region radius converges to zero regardless of the 
realization of the model sequence {M/%} made by the algorithm, as long as the fraction of Cauchy 
decrease is achieved by the step at every iteration. 

71 



Lemma 3.1 For every realization of AlaorithmW. 



lim 8 k = 0. 

k— >oo 



Proof. Suppose that {5k} does not converge to zero. Then, there exists e > such that 
#{k : 5k > e} = oo. Because of the way 5k is updated we must have 

# I k : 5 k > -, 5 k+ i >5 k \ = oo, 

in other words, there must be an infinite number of iterations on which 5 k +i is not decreased, 
and, for these iterations we have p>r}\ and \\gu\\ > V2t:- Therefore, because ([1]) and ([2]) hold, 

f(x k ) - f(xk + 8k) > m (m(x k ) - m(x k + s k )) 

. «/cdi, n - / \\9k\\ x \ /on 

* I K bhm ) 

. Kfcd . \ m \ e 2 
2 [n bhm J 7 Z 

This means that at each iteration where 5k is increased, / is reduced by a constant. Since / is 
bounded from below, the number of such iterations cannot be infinite, and hence we arrived at 
a contradiction. □ 

Another result that we use in our analysis is the following fact typical of trust-region methods, 
stating that, in the presence of sufficient model accuracy, a successful step will be achieved, 
provided the trust-region radius is sufficiently small relatively to the size of the model gradient. 

Lemma 3.2 If rrik is (K eg , n e f) -fully linear on B(xk,5k) and 

llfffell K fcd (l - m)\\9k\ 



5k < min 



Kbhm 4^e/ 

then at the k-th iteration p k > 771 . 

The proof can be found in |15t Lemma 10.6]. 

4 Convergence of the first order trust-region method based on 
probabilistic models 

We now assume that the models used in the algorithm are probabilistically fully linear, and show 
our first order convergence results. First we will state an auxiliary result from the martingale 
literature that will be useful in our analysis. 

Theorem 4.1 Let Gk be a submartingale, i.e., a sequence of random variables which, for every 
k, are integrable (R(\Gk\) < 00 ) and 

E[G fc |if_!] > Gk-i, 

where Ff?_ 1 = o~(Gq, • • • , G k -i) is the a -algebra generated by Gq, ■ ■ ■ , Gk-i and EfG/d-F^J de- 
notes the conditional expectation of Gk given the past history of events F^_ 1 . 

Assume further that \G k — Gk-i\ < M < 00, for every k. Consider the random events 
C = {linifc^oo Gk exists and is finite} and D = {lim sup^^ Gk = 00}. Then P(C U D) = 1. 



Proof. The theorem is a simple extension of [16^ Theorem 5.3.1], see [16^ Exercise 5.3.1]. □ 

Roughly speaking, this results shows that a random walk with bounded increments and an 
upward drift either converges to a finite limit or is unbounded from above. We will apply this 
result to log A^ which, as we show, is a random walk with an upward drift that cannot converge 
to a finite limit. 

4.1 The liminf-type convergence 

As it is typical in trust-region methods, we show first that a subsequence of the iterates drive 
the gradient of the objective function to zero. 

Theorem 4.2 Suppose that the model sequence {M/%} is probabilistically (K eg ,K e f) -fully linear 
for some positive constants n eg and n e f. Let {X^} be a sequence of random iterates generated 
by Alaorithm \3.1\ Then, almost surely, 

lrminf||V/(X fc )|| = 0. 

fc— S-oo 

Proof. Recall the definition of events the S k in Definition 13.21 Let us start by constructing the 

following random walk 

k 

W k = ^(2 • l Si - 1), 

where lg t is the indicator random variable (lg i = 1 if Si occurs, lg. = otherwise). From the 
martingale-like property enforced in Definition I3.2|, it easily follows that Wk is a submartingale. 
In fact, one has 

nWklFl,} = E[Wfc_ 1 |F|_ 1 ]+E[2.1 Sfc -l|if_ 1 ] 

= W k _ 1 + 2ni Sk \F£_ 1 }-l 

= W k -i + 2P(S k \F^ 1 )-l 

> W^, 

where F^_-^ = a(ls , . . . , lfi' fc _ 1 ) is the a-algebra generated by lg , . . . , ls k -u m turn contained 
in F k M _ 1 = <j(Mq, . . . , Mfc_i). Since the submartingale W k has ±1 (and hence, bounded) in- 
crements it cannot have a finite limit. Thus, by Theorem 14.11 we have that the event D = 
{lim supfc^oo Wfc = oo} holds almost surely. 

Since our objective is to show that liminffc^oo ||V/(Xfc)|| = almost surely, we can show 
it by conditioning on an almost sure event. All that follows is conditioned on the event D = 
{lim supfc^oo W k = oo}. 

Suppose there exist e > and k\ such that, with positive probability, 

l|V/(*fc)ll > 6, 

for all k > k\. 

Let {xk\ and {5k} be any realization of {Xk} and {Aj.}, respectively, built by Algorithm 13. II 
By Lemma l3.lt there exists k2 such that we have V/c > ki 

5 k < b := mm <^ , , — - , -^— i '-, \ . 4 

[2K eg 2n bhm 2r\2 8k 6/ 7 J 
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Consider some iterate k > k$ := max{A;i, A^} such that lg h = 1 (model rrik is fully linear). Then, 
from the definition of fully linear models 



c 
2' 



l|V/(x fc ) -5 fc || < KegJfc < 

hence, 

\\Qk\\ > — • 

Using Lemma 13.21 we obtain /j^ > 771. Also 

lldfcll > g - ^ 2 ^- 

Hence, by the construction of the algorithm, and the fact that 5k < -"rrS we have 5k+i = j5k- 

Let us consider now the random variable Rk with realization r^ = log (h" 1 5k)- For every 
realization {rk} of {-Rfc} we have seen that there exists feo such that r^ < for fc > fco- Moreover, 
if ls fc = 1 then rfc+i = r^ + l, and if l,s< fe = 0, Tk+i > ^fc — 1 (implying that i?/% is a submartingale) . 
Hence, r& — r^ > w^. — w& (w/% denoting a realization of W^). Since we are conditioning on the 
event D, we have that Rk has to be positive infinitely often with probability one, contradicting 
the fact that for all realizations of {rk} of {Rk} there exists fco such that r& < for k > &o- 
Thus, conditioning on D we always have that liminffc^oo ||V/(Xfc)|| = with probability one. 
Therefore 

liminf||V/(X fc )|| = 

fc— >oo 

almost surely. □ 



4.2 The lim-type convergence 

In this subsection we show that lim/^oo ||V/(Xfc)|| = almost surely. Before stating and proving 
the main theorem we state and prove two auxiliary lemmas. 

Lemma 4.1 Let {Zk}keN be a sequence of non-negative uniformly bounded random variables 
and {Bk} be a sequence of Bernoulli random variables (taking values 1 and —1) such that 

P(B k = l\a(B 1 ,...,B k . 1 ),a(Z 1 ,...,Zk)) > 1/2. 

Let V be the set of natural numbers k such that Bk = 1 and M = N \ V (note that V and M are 
random sequences). Then 



Prob < > Zi < 00 > n < > Zt = 00 } \ = 0. 




Proof. Let us construct the following process Gk = Gk-i +BkZk- It is easy to check that Gk is a 
submartingale with bounded increments {BkZk}. Hence we can apply Theorem [ITT] and observe 
that the event {limsupfc^^ Gk = —00} has probability zero. Noting that Gk = YlieVi<k^i ~ 

Y^ieX,i<k Zi and nence {Yjiev Z i < 00 } n {Y^ieAf Z i = °°} implies that {limsup^^ G k = -00}, 
if the latter happens with probability zero, then so is the former. □ 



11 



Lemma 4.2 Let {Xk} and {A^} be sequences of random iterates and random trust-region radii 
generated by Algorithm \3. 11 Fix e > and define the sequence {K{\ consisting of the natural 
numbers k for which ||V/(Xfc)|| > e (note that Ki is a sequence of random variables). Then, 

Y^ A fc < oo 



ke{Ki} 



almost surely. 



Proof. Let {m^}, {xk}, {6k}, {k{\ be realizations of {M^}, {Xk}, {A/;}, {Ki} respectively. Let 
us separate {hi} in two subsequences: {pi} is the subsequence of {k{\ such that m Pi is (K eg , K e f)- 
fully linear on B(x Pi , S pi ), and {n^} is the subsequence of the remaining elements of {ki}. 

We will now show that 5^.- g r p > $j < °° f° r an y such realization. If {pj} is finite, then this 
result trivially follows. Otherwise, since 6k — > 0, we have that for sufficiently large pi, 6 Pi < b, 
with b defined by (jl]). Since ||V/(x Pi )|| > e, and m Pi is fully linear on B(x Pi , A Pi ), then by the 
derivations in Theorem 14.21 we have 1 |<7„„. 1 1 > £, and by Lemma l3.2| p p . > 771. Hence, for all pi 



\dpi\\ > §) and by Lemma | 
large enough, the decrease in the function value satisfies 



Thus 






< 



4(/(ao) ~ A) 

m K fcdt 



< 00, 



where /* is a lower bound on the values of / on L{xq). 

For each ki, the event S^ (whether the model is fully linear on iteration ki) has probability at 
least 2 conditioned on all of the history of the algorithm. Hence, we can apply Lemma [4. II (note 
that {Afc} is a sequence of non- negative uniformly bounded variables and S^ are the Bernoulli 
random variables) and obtain 



Prob I < ^2 A j < 00 




oc 



0. 



This means that, almost surely, 






A; 



E 

je{Pi} 



A,- + 



E^ 

i6{iVi} 



< 00. 



We are now ready to prove the lim-type result. 



□ 



Theorem 4.3 Suppose that the model sequence {Mk} is probabilistically (K eg ,K e f) -fully linear 
for some positive constants K eg and K e f. Let {Xk} be a sequence of random iterates generated 
by Alaorithm \3.1\ Then, almost surely, 



lim \\Vf(Xk) 

fe— >oo 
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Proof. Suppose that lim/ £ _ s . 0O ||V/(-Xfc)|| = does not hold almost surely. Then, with positive 
probability, there exists e > such that ||V/(Xfc)|| > 2e, holds for infinitely many k. Without 
loss of generality, we assume that e = — , for some natural number n e . 

Tig 

Let {Ki} be a subsequence of the iterations for which ||V/(Xfc)|| > e. We are going to show 
that, if such an e exists then YljeiK} A? * s a divergent sum. 

Let us call a pair of integers (W',W") an "ascent" pair if < W < W", \\Vf(X w >)\\ < e, 
\\Vf(X w , +1 )\\ > e, \\Vf(X w »)\\ > 2e and, moreover, for any w G (W',W"), e < \\Vf(X w )\\ < 
2e. Each such ascent pair forms a nonempty interval of integers {W' + l, . . . , W"} which is a sub- 
set of the sequence {Ki}. Since liminffc^oo ||V/(J\fc)|| = holds almost surely (by Theorem l4.2|) . 
it follows that there are infinitely many such intervals. Let us consider the sequence of these 

intervals {(W^, W")}. The idea is now to show (with positive probability) that, for any ascent 

w" — 1 
pair (Wj,,W") with I sufficiently large, E =V"+i A? * s un if° rrm y bounded away from (and 

W f/ — 1 

hence WJ. + K W t '), which implies that Eje{KJ A j = °° since E^ E j= V;+i A i ^ £je{* 4 } A i' 
because the sequence {Ki} contains all intervals {W^W"}. 

Let {xk} and {5^} be realizations of {X^} and {A/;}, for which ||V/(xfc)|| > e for k G {ki}. 
By the triangular inequality, for any j, 



e < 



llv/^ii-ilv/^OII < E lllv/(^-)ll-l|v/(a; i+1 ; 

j=w' t 



Since V/ is Lipschitz continuous (with constant KLg), 

e < J2 H|V/(^)||-||V/(x i+1 )||| (5) 

j=w' i 

w'l-l 

K Lg Yl \\ x J- x 3+l\\ ( 6 ) 



< 



3=w, 



< * Lg \5 W ,+ J2 S J • ( 7 ) 

From the fact that 5^ converges to zero, then, for any (. large enough, 6 w i < 2 e , and hence 

Eji^+i Sj > § > 0, which gives us E je{fcl } fy = 00. 

We have thus proved that if, lim^oo ||V/(Xfc)|| = does not hold almost surely, then, 
with positive probability, there exists n e such that {Ki} defined as above based on n e , satisfies 

On the other hand, Lemma r4. 21 guarantees that, for every n e , the probability of Ei's-fic j A? = 
00 is zero. Since there are countable n t G N and since the union of a countable number 
of rare events is still rare we have that the probability of existence of a value n e for which 
Eie-Oc > A? = 00 is zero, which contradicts the initial assumption that lim^oo ||V/(-Xjfc)|| = 
does not hold almost surely. □ 
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4.3 Modified trust-region schemes 

The trust-region radius update of Algorithm 13,11 may be too restrictive as it only allows for 
this radius to be increased or decreased. In practice typically two separate thresholds are used, 
one for the increase of the trust-region radius and another for its decrease. In the remaining 
cases the trust-region radius remains unchanged. Hence, here we propose an algorithm similar 
to Algorithm 13 . 1 1 but slightly more appealing in practice. 

Algorithm 4.1 Fix the positive parameters r\\, 772, 773, 7, <5 max , with 7 > 1 and 773 < 772. At 
iteration k approximate the function f in B(x k , o~k) by m k and then approximately minimize m k 
in B(x k ,5 k ), computing Sk so that it satisfies a fraction of Cauchy decrease ((!]). Let 

f(xk) ~ f{xk + s k ) 

Pk - 



3fc+i 



m(x k ) -m(x k + s k )' 
If Pk > 7/1; then set x k +i = x k + s k and 

7 -1 ^fc if llfffcll < m<>k, 

Sk if m^k < Hsfcll < V2h, 

, min{74,(J max } if rj 2 S k < \\g k \\. 

Otherwise, if p k < r}\, set x k+ \ = x k and 5 k+ \ = r )^ 1 5 k . 

It is straightforward to adapt the proofs of Lemma 13.11 and Theorems 14.21 and 14.31 to show 
the convergence for this new algorithm. Additionally, one can consider two different thresholds, 
< ?7o < 1 for decrease of the trust region radius, and 771 > ?7o for the increase of the trust 
region radius. 

5 Second order trust-region method based on probabilistic mod- 
els 

In this section we present the analysis of the convergence of a trust-region algorithm to second 
order stationary points under the assumption that the random models are likely to provide 
second order accuracy. 

5.1 The probabilistically fully quadratic models 

Let us now introduce a measure of second order quality or accuracy of the models m k (see |14| 
[T5| g] for more details). 

Definition 5.1 We say that a function m k is a (K e h, K eg , n e f) -fully quadratic model of f on 
B(x k ,5 k ) if for every s G B(0,S k ), 

||V 2 /(Xfc + s) -fljfc|| < KehO~k, 

||V/(x fc + s) - S/m k (x k + s)\\ < n eg 5l, 
\\f(x k + s)-m{x k +s)\\ < K ef 5l. 
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As in the fully linear case, we assume that the models used in the algorithms are fully 
quadratic with a certain probability. 

Definition 5.2 We say that a sequence of random models {M^} is (j>) -probabilistically {n e h, K eg , K e f)- 
fully quadratic for a corresponding sequence {B(Xj-, A&)} if the events 

Sk = {Mfc is a (k^, n eg , K e f)-fully quadratic model of f on B(Xk,Ak)} 

satisfy the following submarting ale-like condition 

P(S k \F£U) > p, 

where Fj^_ x = o-(Mq, . . . ,Mk-\) is the a-algebra generated by Mo, . . . ,Mfe_i. Furthermore, if 
p> 2> then we say that the random models are probabilistically (K e h, K eg , K e f)-fully quadratic. 

We now need to discuss the algorithmic requirements and problem assumptions which will be 
needed for global convergence to second order critical points. In terms of problems assumptions 
we will need one more order of smoothness. 

Assumption 5.1 Suppose xq and <5 max are given. Assume that f is twice continuously differ- 
entiate in an open set containing the set L en i(xo) and that V 2 / is Lipschitz continuous with 
constant n^h and that ||V 2 /|| is bounded by a constant k^j on L en i(xo). Assume also that f is 
bounded from below on L(xq). 

We will no longer assume that the Hessian H k of the models is bounded in norm, since we 
cannot simply disregard large Hessian model values without possibly affecting the chances of 
the model being fully quadratic. However, a simple analysis can show that ||-fffc|| is uniformly 
bounded from above for any fully quadratic model nik (although we may not know what this 
bound is and hence may not be able to use it in an algorithm). 

Lemma 5.1 Given constants K eg , n e f, K e h, and 5 max , there exists a constant n^mh such that 
for every k and every realization ra k of Mk which is a (K eg , n e f, K e h)- fully quadratic model of f 
on B(xk,d~k) with Xk G ^(^o) and 8k < (5mm we have 

|| -"fell ^ K bmh- 

Proof. The proof follows trivially from the definition of fully quadratic models and the assump- 
tion that ||V 2 /|| is bounded by a constant k^j on L en [(xo). □ 

It will also be necessary to assume that the minimization of the model achieves a certain 
level of second order improvement (an extension of the Cauchy decrease). 

Assumption 5.2 For every k, and for all realizations my* of M^ (and of X^ and A^), we are 
able to compute a step Sk so that 



m k (x k ) - m k (x k + Sk) > -{7- max { llSfcll min 



9k\\ x 
,o k 



H. 



,max{-X miR (H k ),0}6l\ . (8) 



for some constant Kf d £ (0, 1]. We say in this case that Sk has achieved a fraction of optimal 
decrease. 
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A step satisfying this assumption is given, for instance, by computing both the Cauchy 
step and, in the presence of negative curvature in the model, the eigenstep, and by choosing 
the one that provides the largest reduction in the model. The eigenstep is the minimizer of the 
quadratic model in the trust region along an eigenvector corresponding to the smallest (negative) 
eigenvalue of H k . 

The measure of proximity to a second order stationary point for the function / is slightly 
different from the traditional, and is given by 

l|V/(x)|| 



T(X 



max < mm 



l|V/(x)||, 



l|V 2 /(x)|| 

The model approximation of this measure is defined similarly: 

|Vm(x)|| 



AmintV 2 /^)) 



r m (x) 



max < mm 



|Vm(x)||, 



A m in(V 2 m(o;)) 



||V 2 m(x)||. 

We consider the additional terms ||V/(x)||/||V 2 /(x)|| and ||Vm(x)||/||V 2 m(x)|| given that we 
no longer assume a bound in the model Hessians as we did in the first order case. We show now 
that t(x) is Lipschitz continuous under Assumption ^. 11 

Lemma 5.2 Suppose that Assumption I5.il holds. Then there exists a constant kl t such that 
for all xi,x 2 € L en i(x ) 

|t(xi) -t(x 2 )| < klt\\xi -x 2 ||. (9) 

Proof. First we note that under Assumption 15.11 there must exist an upper bound Kbf g > on 
the norm of the gradient of /, ||V/(x)|| < Kbf g for all x £ L en i(xo). 

Then let us see that h(x) = min{||V/(x)||, ||V/(x)||/||V 2 /(x)||} is Lipschitz continuous. 
Given x,y S L en i(xo), one consider four cases: (i) The case ||V 2 /(x)|| > 1 and ||V 2 /(y)|| > 1 
results from the Lipschitz continuity and boundedness above of the gradient and the Hessian, 
(ii) The case ||V 2 /(x)|| < 1 and ||V 2 /(y)|| < 1 results from the Lipschitz continuity of the 
gradient, (iii) The argument is the same for the other two cases, so let us choose one of them, 
say ||V 2 /(x)|| < 1 and ||V 2 /(y)|| > 1. In this case, using these inequalities, one has 



\h(x)-h(y)\ < 



l|V/(x)|| 



l|V/(y)|| 



l|V 2 /(y)|| 
< ||V/(x 



< 



l|V/(x)|| 

v 2 /(y)ll 



l|V/(x)|| 



l|V 2 /(y)|| 
l|V 2 /(x)|| 



+ 



K L g\ 



l|V 2 /(y)|| 

+ KLg\\x-y\\ 



Thus, \h(x) - h(y)\ < (K bfg K Lh + K Lg )\\x - y\\. 

The proof then results from the fact the maximum of two Lipschitz continuous functions 
is Lipschitz continuous and the fact that eigenvalues are Lipschitz continuous functions of the 
entries of a matrix. □ 

The following lemma shows that the difference between the problem measure t(x) and the 
model measure r m (x) is of the order of 5 if m(x) is a fully quadratic model on B(x,5) (thus 
extending the error bound on the Hessians given in Definition l5.ip . 



Lemma 5.3 Suppose that Assumption \5. 1\ holds. Given constants K eg , n e f, K e h, and <5 max there 
exists a constant K eT such that for any m^ which is (n eg , K e f, K e h) -fully quadratic model of f on 
B(xk,5k) with Xk £ L(xq) and 5k < <5 max we have 

\T(x k )-T m (x k )\ < K eT 5 k . (10) 
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Proof. From the definition of fully quadratic models and the upper bounds on ||V/|| and ||V 2 / 1| 
on L en i(xo), we conclude that both ||Vm(xfc)|| and ||V 2 m(xfc)|| are also bounded from above 
with constants independent of x k and 5 k . 

For a given x k several situation may occur depending on which terms dominate in the 
expressions for r(x k ) and T m (x k ). In particular, if ||V 2 /(xfc)|| < 1 and ||V 2 m(xfc)|| < 1, then 



r{x k ) 



max 



{||V/(x fc )||,-A min (V 2 /(x fc ))} 



and 



T m {x k ) = max{||Vm(x fc )||,-A min (V 2 m(x fc ))} 

and the proof of the lemma is the same as in the case of the usual criticality measure, analyzed 
in |15j . Let us consider the case, when ||V 2 /(x/ c )|| > 1 and ||V 2 m(xfc)|| > 1. From the fact that 
m k is (n eg , n e f , K e h)-fully quadratic we have that 



|Vm(x fc )|| ||V/(x fc )|| 



|V 2 m(x fc )|| ||V 2 /(x fc )|| 



< |||Vm(x fc )||||V 2 /(x fc )|| - ||V/(x fe )||||V 2 m(x fc )||| 

< \\V 2 f(Xk)\\K e fSl + \\S/f(x k )\\K eh 5k < K er 5 k , 



for some large enough K er , independent of x k and 5 k . 
The other two cases that need consideration are 



T m (Xfe) 



|| Vm(i) 



y> m {x k )\\ » T ( Xfc ) = ll V /(^fc)l|, and 
r{*k) = |S,r m M = ||Vm(x fc )||. 



Let us consider the first case 



\r(x k )-r m (x k )\ 



l|V/(x fc )|| 



|Vm(x fc )|| 



|V 2 m(xfc) 



< 



l|V/(x fc )|| 



l|V/(x fc )|| 



IV 2 



m{x k ) 



+ 



K eg b k 



|V 2 m(xfc) 



< 



||V/(x fc )||(||V 2 m(x fc )||-l + Ke9 ^) < \\Vf(x k )\\(K eh 5 k + K eg 5 2 k ) < K eT 5 k , 



for some large enough k £t , independent of x k and 5 k . The proof of the second case is derived in 
a similar manner. Combining these results with standard steps of analysis, such as the one in 
in [15] we conclude the proof of this lemma. □ 

Let us now define T k = r(x k ) and tJP = T m (x k ). From the assumption that V 2 /(x) 
is bounded on L en ;(xo), it is clear that if r k — > (when k — > oo), then Vf(x k ) — > and 
max{— A m i n (V 2 /(xfc)), 0} — > 0. We next present an algorithm for which we will then analyze 
the convergence of r k . 



5.2 Algorithm and liminf-type convergence 

Consider the following modification of Algorithm 13.11 



Algorithm 5.1 Fix the positive parameters rji, rj2, 7, with 7 > 1. At iteration k approximate 
the function f in B(x k , 5 k ) with m k and then approximately minimize m k in B(x k , 5 k ), computing 
s k so that it satisfies a fraction of optimal decrease (J5J) . Let 



Pk 



f{xk) ~ f(x k + s k ) 
m(xfc) -m(x k + s k ) 
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If Pk > Vi an d Tfc > mh, set x k +i = x k + s k and 4+1 = min{74,£ max }. Otherwise, set 
Xfc + i = x k and 5k+\ = 7~ 1 <5/ . Increase k by one and repeat the iteration. 

The analysis of this method is similar to that of the first order method described in Section [3l 
The main difference lies in a replacement of the use of assumptions and in the lack of proof of 
the lim-type result. First, we will follow the steps of Section [3] to analyze the behavior of the 
trust-region radius. 

Lemma 5.4 For every realization of Alaorithm \5. 1\ 

lim 5k = 0. 

k— >oo 

Proof. Suppose that {5k} does not converge to zero. Then, there exists e > such that 
#{k : 5k > e} = oo. We are going to consider the following subsequence {k : 5k > #:, <5fc+i > 5k}- 
By assumption this subsequence is infinite and due to the way 5 k is updated we have t™ > 7/2 j- 
for each k in this subsequence. 

First assume that min{||<7/%||, Mr 1 } > 772 — ■ Therefore, from ([8|) we have 

f{xk)~ f{xk + s k ) > m (m(x k ) - m(x k + s k )) 

. K fod u n . / \\9k\\ x 

2 I ll-"fc|l 

2 

> r/i^-T/2^min{r/2,l}. 

2 7 Z 

Now assume that —Xmin(Hk) > ^2^- Therefore, from (|8|) we have 

f{xk) ~ f(xk + s k ) > rji (jn(x k ) - m(x k + s k )) 

> -m^KUHk)st 

Kfod e 3 

> 771^5-772-3- 

This means that at iteration k the function / decreases by an amount bounded away from 
zero. Since we have assumed that there is an infinite number of such iterations, we obtain a 
contradiction. □ 

The next step is to extend Lemma 13.21 to the second order context. 
Lemma 5.5 If m k is (w e /i, K eg , K e f)-fully quadratic on B(xk,5 k ) and 



5 k < min {tJJ 1 , 



then at the k-th iteration pk > rji. 



«/od(l - Vi)i~r Kfodil - rn) 1 k 






4k 6/ 4k 6 / 



The proof is a trivial extension of the proof of [15} Lemma L0.17] taking into account our 
modified definition of rl™ . 

We can now prove the following convergence result which states that a subsequence of iterates 
approaches second order stationarity almost surely. 
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Theorem 5.1 Suppose that the model sequence {M^} is probabilistically (n e h, K eg , n e f) -fully 
quadratic for some positive constants K e h, n eg , and K e f. Let {Xk} be a sequence of random 
iterates generated by Alaorithm \5.1[ Then, almost surely, 

lim inf r^ = 0. 

k— >oo 

Proof As in Theorem 14.21 let us consider the random walk Wk = Y2i=o@ ' ^-Si ~ -0 ( wnere 1& 
is the indicator random variable, now based on the event Si of Definition 15. 2\i . All that follows 
is also conditioned on the almost sure event D = {lim sup^^ Wk = oo}. 

Suppose there exist e > and k\ such that, with positive probability, Tf. > e, for all k > k\. 
Let {xk} and {8k} be any realization of {Xk} and {A^}, respectively, built by Algorithm 15.11 
From Lemma l5,l| there exists &2 such that we have "ik>k 2 




A k • : b := min I — , -, — , "M 1 ~ *> , "/«*(! ~ ^ *=« I > . (11) 

Let k > ko := maxjfci, /C2} such that ls k = 1. Then, |rfc — t™\ < n eT 5k < |, and thus t™ > |. 
Now, using Lemma \5.5\ we obtain p k > 771. We also have r™ > | > r/2^. Hence, by the 
construction of the algorithm, and the fact that 5k < -^f L , we have 8k+i = jd~k- 

The rest of the proof is derived exactly as the proof of Theorem 14.21 (defining the random 
variable Rk with realization r^ = log 7 (b _1 5fc), but with b now given by (fTTp ). Conditioning 
on D we obtain lim inf k^oo T k = 0, and thus lim inf fc-^oo r^ = almost surely. □ 



5.3 The lim-type convergence 

Let us summarize what we know about the convergence of Algorithm EUJ Clearly all results that 
hold for Algorithm 13.11 also hold for Algorithm 15. 1\ hence as long as the probabilistically fully 
linear (or fully quadratic) models are used, almost surely, the iterates of Algorithm 15.11 form a 
sequence {xfc}, such that V/(x^) — > as k — > 00, in other words, the sequence {xk} converges 
to a set of first order stationary points. Moreover, as we just showed in the previous section, 
as long as the probabilistically fully quadratic models are used, there exists a subsequence of 
iterates {xk} which converges to a second order stationary point with probability one. Note that 
under certain assumptions, for instance, assuming that the Hessian of f(x) is strictly positive 
definite at every second order stationary point, we can conclude from the results shown so far 
(and similarly to Theorem 6.6.7]) that, almost surely, all limit points of the sequence of 
iterates of Algorithm 15.11 are second order stationary points. 

There are however cases, when the set of first order stationary points is connected, and 
contains both second order stationary points and points with negative curvature of the Hessian. 
An example of such a function is 

f(x) = xy 2 . 

All points such that y = for a set of first order stationary points, while any x > gives us 
second order stationary points, while x < does not. In theory our algorithm may produce 
two subsequences of iterates, one converging to a point with y = and x > (a second order 
stationary point), and another converging to a point for which y = and x < (a first order 
stationary point with negative curvature of the Hessian) . 
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Theorem 6.6.8 in [8] shows that all limit points of a trust-region algorithm are second order 
stationary without the assumption on these limit points being isolated, but under the condition 
that the trust-region radius is increased at successful iterations. The results in [TS] show that 
all limit points of a trust-region framework based on deterministic fully quadratic models are 
second order stationary under a slightly modified trust-region maintenance conditions. While 
the same result may be true for Algorithm 15.11 using probabilistically fully quadratic models, we 
were unable to extend the results in |15j to this case. Below we present explanations where such 
extension fails, but the key lies in the fact that successful iterations and hence increase in the 
trust region are no longer guaranteed. 

Conjecture 5.1 Suppose that the model sequence {M^} is probabilistically (K e h, n eg , K e f) -fully 
quadratic for some positive constants K e h, K eg and K e f. Let {X^} be a sequence of random 
iterates generated by Algorithm \5.1\ Then, almost surely, 

lim Tfc = 0. 

k— >oo 

Let us attempt to follow the same logic as in the proof of Theorem 14.31 The first part of the 
proof applies immediately after substituting ||V/(x)|| by t(x) wherever is appropriate. 

Indeed, suppose that limfc^ 00 r(Xfc) = does not hold almost surely. Then, with positive 
probability, there exists e > such that r{Xk) > 2e, holds for infinitely many A;'s. Without loss 
of generality, we assume that e = — , for some natural number n e . 

Let {Ki} be a subsequence of the iterations for which r(Xfc) > e. We are going to show that, 
if such an e exists then X)ie{-K'j A? ^ s a divergent sum. 

Let us call a pair of integers (W',W") an "ascent" pair if < W < W", t(X w ,) < e, 
t{X\y'+i) > e, t(X\y") > 2e and, moreover, for any w € (W',W"), e < r{X w ) < 2e. Each such 
ascent pair forms a nonempty interval of integers {W' + 1, . . . , W"} which is a subset of the 
sequence {Ki}. Since lim inf ^oo t{XjS) = holds almost surely (by Theorem 15. ip . it follows 
that there are infinitely many such intervals. Let us consider the sequence of these intervals 
{{Wg,W")}. The idea is now to show (with positive probability) that, for any ascent pair 

{W'^W'D with £ sufficiently large, J2-=w'+i^3 ^ s un if° rm b7 bounded away from (and hence 

W"—l 

W'z + 1 < W t ), which implies that Eje{^} A i = °° since E^ E j= V;+i A i - ^je{K,} A i> 

because the sequence {Ki} contains all intervals {W^,W"}. 

Let {xk} and {5k} be realizations of {X^} and {A^}, for which rj, > e for k £ {h}. By the 
triangular inequality, for any j, 



T 

.r- 



e < T w' e -T w '; < 2^ l r i- T i+il 
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Since r(x) is Lipschitz continuous (with constant Klt), 

e < Y, l^-^+il ( 12 ) 

w'l-l 

- KLt ^2 W X J "^i+ill ( 13 ) 

(14) 
From the fact that <5fc converges to zero, then, for any £ large enough, 5 w i < oif— , an d hence 

U)"-l 

Ejiwj+i 5 i > f > 0, which gives us E je{fcl } *j = oo. 

We have thus proved that if, limfe_ ) . (X) r(Xfc) = does not hold almost surely, then, with 
positive probability, there exists n e such that {Ki} defined as above based on n e , satisfies 

Ei6{JCi} A i = °°- 

The second part of the proof should rely on providing the contradiction to the statement that 

^.g/jj-.-i Aj = oo can happen with positive probability. In the case of Theorem 14.31 the proof 
utilized the fact that the sum of all Sj, for all j G {p{\ such that rrij is probabilistically fully 
linear, has to be finite, because it appears as a term in the lower bound on the total decrease 
of the objective function (see Lemma |4.2|) . However, in the second order case the total decrease 
of the objective function is bounded from below by a factor of Ejefp } $j- Hence it is possible 
that Yljefk-} fij < °°' w hile Yljeik} fy = °°- ^ n * ne deterministic case the proof of the fact that 
Y2 jeik} ^3 "^ °° renes on the trust-region maintenance strategy. In particular if the model rrij 
is fully quadratic for every j E {ki} then the trust-region radius is increased at each iteration 
j G {h}. In other words, for large enough £, 5 w ' +2 = jA w / + i and so on until 5 tu »„ 1 = r y5 w "_ 2 - 
Hence 

2 ^ < M8 w n_ x . (15) 

Ul"-1 

This then implies that Yl -Lm'+x Sj —^ bs £ —^ because 5 w n_ 1 — > 0. 

The difficulty in the probabilistic case comes from the fact that the trust region only increases 
with some high probability, but not necessarily at each iteration. In general it is possible to 
construct examples of random walks, which satisfy the conditions on the maintenance of 5j, but 
for which with positive probability there exist arbitrarily large indices w'^ and w'l, such that 
w'l — w' e is arbitrarily large, and such that (fT5]) does not hold. In other words, under the current 
conditions we are not able to show that (|15p holds with probability one for all I, hence we cannot 
prove the conjecture. The proof can be established under the additional assumption that the 
probability of occurrence of fully quadratic models increases in some cases as the algorithm 
progresses. However, this scenario essentially leads to deterministic schemes and hence is of no 
interest for this paper. 
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6 Examples of probabilistically fully linear and fully quadratic 
models in DFO 

In the previous section we described an algorithmic framework that is based on models whose 
approximation quality is random and is sufficiently good with probability more than 1/2 (con- 
ditioned on the past). We called such models probabilistically fully linear or fully quadratic, 
depending on the quality of approximation that they provide. In this section, we discuss how 
such models can be generated (for some large enough values of the k constants) and outline 
future research in this direction. 

6.1 Fully linear and fully quadratic polynomial interpolation models and A- 
poised sample sets 

Let V% denote the set of polynomials of degree < d in R n and let gi = q + 1 denote the 
dimension of this space. It clear that the dimension of V\ is q± = n + 1 and the dimension of V\ 
is q\ = i(n + l)(n + 2). A basis <3? = {4>o(x), 4>\(x), . . . , 4> q (x)} for V 1 ^ is a set of q\ polynomials 
of degree < d that span V^. For any such basis $, any polynomial m(x) £ V% can be written as 



m(x) = ^2aj<pj(x), 



3=0 

where the Oj-'s are real coefficients. Given a set of p\ = p + 1 points Y = {yo,yi, 
m(x) is said to be the interpolation polynomial of f(x) on Y if it satisfies 



M($,Y)a = f(Y), 



(16) 



jpKr, 



(17) 



where M($, Y) is defined as follows 



M($,Y) 



My ) 
My 1 ) 



My ) 
My 1 ) 



My p ) My p ) 



My ) 
My 1 ) 



My p 



;is) 



and f(Y) is the p\ dimensional vector whose entries are f(yi) for i = 0, . . . ,p. The interpolation 
polynomial m(x) exists and is unique if and only if p = q and the set Y is poised [27], which 
essentially means that M(<I>, Y) is nonsingular. When the number of points p\ is smaller than the 
number of elements in $ the matrix M(<i>, Y) has more columns than rows and the system (|17p 
is under deter mined. In this case there are several choices of interpolating polynomials, which we 
will discuss later. If, on the other hand, p > q, then the system ()17p is overdetermined and one 
can apply least squares regression instead of interpolation. Other polynomial approximations 
are also possible. If Y is such that the condition number of M(&,Y) is bounded by A, where 
Y = {(yo — Xfc)/A, . . . , (y p — x^/A.) is a scaled version of Y, then we say that Y is A-poised 
(see |15j ) . It is shown in [12|. [13j [15] that if Y is A-poised and p\ > n + 1, then one can build 
a model which is (n e f, K eg )-iul\y linear, with K e f and K eg both equal to 0(pA). Analogously, it 
is shown that if p\ > (n + l)(n + 2)/2, then one can also build a (n e f, K eg , fte/J-fully quadratic 
polynomial model with n e f, n eg , and n e h equal to 0(pA). Hence, to build a fully quadratic model 
in n dimension one may require (n + l)(n + 2)/2 sample points (within reasonable proximity 
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of the current iterate Xfe). If such a sample set is already available, estimating the condition 
number of the matrix M(<1>, Y) may require up to 0(n 6 ) arithmetic operations. This dependency 
on the dimension limits the use of fully quadratic models to small dimensional problems. 

There are two main ways to improve the per-iteration complexity of DFO algorithms. One 
approach, to only change the sample set by one point at a time, has been very successful in 
practice, as it not only reduces the number of function evaluations, but also the linear algebra 
involved [HI [TBI [29l [35] . However, in [31] is was shown that such algorithms still require comput- 
ing a A-poised set in the criticality step of the trust-region framework. Hence computation of n 
new sample points is required if fully linear models are used, while for fully quadratic models 
(n + l)(n + l)/2 — 1 new sample points have to be evaluated. 

The other, complementary, approach is to use quadratic models based on fewer than (n + 
l)(n + 2)/2 sample points, which also reduces both the cost of the linear algebra and the number 
of function evaluations. In practical DFO applications, incomplete quadratic models have been 
used very successfully. 

Let Y = {yo,yi, ■ ■ ■ ,y P } be a set ofp+1 sample points withp < q and let $ = {l,x±,X2, ■ ■ ■ ,x n , 
x\/2,x\X2, ■ ■ ■ ,x n -ix n x^ l /2}, The interpolating polynomial for / on the set Y is given by (USD, 
where a satisfies the undetermined interpolation system (jTTJ) . Since this system admits multiple 
solutions we have some freedom in selecting a. In |11[ [T5l [35] the minimum Frobenius norm 
(MFN) models are considered, i.e., the models for which the Frobenius norm of the Hessian, or 
1 1 ckq 1 1 2 = ||(«n+i>ttn+2, • • • , «<?) H2, is minimized subject to (fTT[) . In [29] Powell selects the model 
based on minimizing the Frobenius norm of the update of the Hessian. Both these methods are 
successful in practice, and provide useful second order information. However, so far theoretically 
they are not shown to be superior to simple linear models. Indeed as we will show in the example 
below the MFN models may be nearly as bad as simple linear models, but the use of random 
sample sets can provide a significant practical improvement in this case. 

6.2 Random sample sets 

In the cases when function evaluations are not very expensive or can be obtained in parallel, 
there is less incentive to reuse old sample points for model building, because ensuring the model 
quality can become the bottleneck of the computations. Instead, one can simply use well-poised 
deterministic sample sets, chosen in advance. However this is not always the best approach, 
because the pattern is chosen without any consideration for the shape of the function and may 
be a very poor fit. In Section [2] we have seen examples where random directions provided 
better decrease on average than those from a fixed pattern. Similarly, random sample sets can 
automatically provide good quality models with high enough probability, yet they do not suffer 
from the worst case behavior of the deterministic sample sets. We consider another example. 

Example of comparison of using underdetermined quadratic models based on ran- 
dom and deterministic sample sets. Let us consider the function f(x) = 10(a?2 ~~ x i) 2 + 
(1 — xi) 2 , which is a version of the Rosenbrock function used in Section [21 but with smaller 
curvature and Hessian condition number. We apply a trust-region method [3] to this function 
with models based on 5 points at each iteration and construct MFN models based on the sample 
sets. Note that a fully quadratic model requires 6 points. In one case we choose deterministic 
models with the sample set selected as the current iterate plus the coordinate steps of length 5, 
i.e., Y = {y°, y° ± 5e±,y° ± <5e2J — a very well poised set. In other words, the set Y is generated 
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Iter. # 


success 


/ value 


A 


P 


1687 





+3.6742071 le-04 


+3.12e-02 


-1.66e+00 


1688 


1 


+3.67418778e-04 


+6.25e-02 


+8.13e+02 


1689 





+3.67418778e-04 


+3.12e-02 


-1.66e+00 


1690 


1 


+3.67409693e-04 


+6.25e-02 


+3.92e+03 


1691 





+3.67409693e-04 


+3.12e-02 


-1.66e+00 


1692 


1 


+3.67407812e-04 


+6.25e-02 


+8.34e+02 


1693 





+3.67407812e-04 


+3.12e-02 


-1.66e+00 


1694 


1 


+3.67398959e-04 


+6.25e-02 


+4.03e+03 


1695 





+3.67398959e-04 


+3.12e-02 


-1.66e+00 



around the current iterate y° by adding coordinate steps of size 5. For the second method we 
generate the set Y by picking 4 random points in a ball of radius 5 around the current iterate. 
The results are as follows: the method based on deterministic sample sets achieved the final 
function value of 10 -4 in 8500 function evaluations, while the method based on random sample 
sets achieved the function value of 10 -6 in 2700 function evaluations. Clearly, using random 
sample sets enhances the performance of the MFN models here. In particular, one can observe 
the slow progress of the deterministic method in Table HOI which represents iteration output. It 
is clear from the table that iterations follow a pattern (which starts at around iteration 1000) 
where 5k increased and decreased according to alternating successful and unsuccessful steps, 
while the progress is slow overall. 



Analysis of poisedness of random sample sets. Let us consider a sample set Y = 
{0, y 1 , . . . ,y p } C W 1 with a fixed point at the origin and the remaining n points being gen- 
erated randomly from a standard Gaussian distribution centered at the origin. Let us consider 
<]?(x) = {l,x\,X2, ■ ■ ■ ,x n }. Hence M(3>, Y) is simply a matrix whose first column is all l's, the 
first row is zero except the first element and the remaining p x n matrix is a Gaussian random 
matrix. Under a simple transformation, the condition number of M($, Y) is equal to the condi- 
tion number of a random Gaussian p x n matrix. From results in random matrix theory [T7] 
we have the following bound 



P(cond(M($,Y)) > A) < C(n,p) 



1 



A}n— p\+l 



where C(n,p) is a constant dependent on p and n. In particular, for p = n the result in [7] 

implies 

1 Cn 
P(cond(M($,Y)) > A) < ~^^f- 

where C is a universal constant smaller than 6.5. This result implies that given n and p, there 
exists A large enough such that P(cond(Af (<&, Y) < A) > ^. Hence there exist constants K e f 
and K eg such that the linear interpolation (or regression) polynomials based on Gaussian sample 
sets are probabilistically (K e /> K e9 )-fully linear. 

A more complicated, but important class of models are the quadratic models based on 
sample sets ofp+l>n + l sample points. In this case the basis <£ is constructed from first 
and second order polynomials and M($, Y) no longer has the simple structure of a Gaussian 
matrix. Matrices of this form have been studied in 1 30 1 and arc referred to as structured random 
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matrices. The bounds derived in [30] show that the condition number of M($, Y) is small with 
sufficiently high probability if p is large enough (but still scales pseudo- linearly with the number 
of columns in M($, Y)). We believe that these results can be used to show that for a fixed p, the 
condition number of M (<£, Y) is bounded with sufficiently high probability. Explicit derivations 
of such bounds and conditions is subject for future research. 

Sparse models based on random sample sets. A natural question that arises in our 
context is whether we can build accurate, i.e., fully quadratic models, without requiring (n + 
l)(ra + 2)/2 sample points. For instance, in larger dimensional cases it often happens that the 
Hessian of the objective function is sparse. Clearly, if we know in advance that some elements 
of the Hessian (coefficients of qq) are zero, then we can reduce the number of variables in 
system (|17p . However, in a typical situation of a black-box optimization, the information about 
the sparsity of the Hessian is not available. It has been shown recently in [3] that by minimizing 
||o:q||i instead of ||aQ||2 it is possible to recover fully quadratic interpolation models of a function 
with sparse Hessian by using fewer sample points than (n + l)(n + 2)/2. This is the first result 
that shows that fully quadratic model recovery with incomplete sample sets is possible. This 
result relies on the theory of sparse recovery in compressed sensing [6] and on results in random 
matrix theory [3D]. In particular these models are shown to be fully quadratic with probability 
larger than 1 — re~ 7 gp , for some universal constant 7 > 0, as long as the number of sample 
points satisfies p > O (n(logra) 4 ) and a sparse fully quadratic model exists. 

Similar, but much simpler results can be obtained for recovery of a sparse fully linear model, 
if such a model exists. In this case, the sample set Y can be generated by a Gaussian distribution 
around the current iterate and the random matrix M($, Y) can be viewed as a Gaussian matrix, 
just as it described above. Sparse signal recovery can be applied in this well-known case to show 
that if the number of nonzeros in the gradient is s and the number of sample points is 

p > Cs\og(n/s), 

then a sparse fully linear model can be recovered with probability greater than 1 — cie _C2P , for 
some universal constants c±, C2, and C. In fact the constants also depend on the error between 
the function values f{yi) and the sparse model values m(yi), but we omit these details here for 
simplicity. 

Nonuniform recovery and martingale property. In the examples we considered so far 
the sample sets are generated to provide high quality of the models independently of the past 
history of the algorithm. However, our theory allows the probability of a good model to be 
dependent on the past. In some cases taking this into account may provide a more efficient 
approach to building models. Here we discuss one possible example. 

The results of recovery of sparse models which we considered so far from compressed sensing 
imply, the so-called, uniform recovery, where the matrix M($,Y) is designed in such a way 
that any sparse model can be recovered. However, in our case, it is sufficient to recover the 
specific model that happens to approximate the objective function / sufficiently well in a trust 
region. Thus, the nonuniform recovery results can apply. Some of these results, including the 
ones for the Gaussian matrices, can be found in [2] [30]. The key is that if only one fixed signal 
needs to be recovered with high probability, then it is sufficient to generate the random matrix 
M($, Y) using fewer samples than what is necessary for the uniform recovery. The probability 
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Figure 3: RSTR: 
n/=346, /=10" 5 , 



it=l8, n/=494, /=10- n , GSTR: »t=164, n/=185, /=10~ 8 , MFN: it=325, 



of generating a fully linear or fully quadratic model can be made sufficiently high, conditioned 
on the model itself. This fact, in our setting, means that the probability of a "good" model 
is high conditioned on the current iterate and trust-region radius, in other words, on the past 
behavior of the algorithm. In short, we observe that such a setting will satisfy the submartingale 
property, but not complete independence on the past. 

Example of comparing performance of sparse model recovery vs. other underdeter- 
mined second order models. Consider the following function again, f(x) = 10(x2 — x\) 2 + 
(1 — xi) 2 , but this time x G M 10 , which means that we have a 10-dimensional problem, but 
only the first two dimensions are important. Note that to build a fully linear model without 
applying sparse recovery we need to sample 11 points and to build a fully quadratic model we 
need 66 sample points. We apply three variants of the trust-region algorithm [3] to this problem 
which only differ by the choice of the models. In the first case the models are built based on 26 
random points that are distributed in a small hypercube around the current iterate (a range of 
points from 20 to 30 was tried with similar results), we call this method RSTR. In the second 
case we build sparse models based on "greedy" sample sets of up to 31 points, which only use 
points generated in the course of the trust-region steps, in other words, reusing old points, we 
call it GSTR. The third algorithm uses the same greedy sample sets, but constructs MFN mod- 
els, rather than sparse models, we call this method MFN. The resulting optimization paths are 
illustrated in Figure and the final outcome is as follows: 

1. RSTR: Number of iterations: 18, number of function evaluations: 494, final function value: 
4.0e-ll. 



GSTR: Number of iterations: 
value: 5.0e-8. 



164, number of function evaluations: 185, final function 



3. MFN: Number of iterations: 325, number of function evaluations: 346, final function value: 
2.5e-5. 

This example illustrates that the RSTR clearly recovers the fully quadratic models of /, 
while the other two methods do not. This is evident from the number of iterations required 
by each algorithm. While the first algorithm performs more function evaluations, they can be 
obtained in parallel, and the achieved accuracy is by far better than that of the other methods. 



26 



Other random models. Additional settings where relying on random models may give an 
advantage for an optimization scheme occur in a parallel environment when full synchronization 
is not needed. In other words, if function evaluations are obtained in parallel for a collection 
of sample points, some of the function evaluations may take much longer than others. In that 
case it is possible to compute a model based on a sufficiently large subset of sampled values and 
ignore the points whose function values are not returned on time. Under the assumption that 
the function computation failures occur randomly, the remaining subset is still a random sample 
set. 

Alternatively, one may consider a setting where the objective function is evaluated approxi- 
mately for each sample point, with some high probability of this approximation being accurate, 
but yet some small probability of a bad approximation. In this case the resulting interpola- 
tion/regression model will provide a good approximation with high probability. Note that when 
computing the function value at the potential new iterate (rather than a sample point) one is 
still assuming that an accurate value is computed. Relaxing this condition is also a subject for 
future study. 

Reusing sample points. In a sequential computational setting with expensive function eval- 
uations it is efficient to reuse existing sample points in the vicinity of the current iteration. 
The success of the second method in the example above indicates that sparse models based on 
greedy sample sets are useful, even though the sparse recovery properties are unlikely to hold for 
such sets. Hence the random sample models may be dependent in some practical approaches. 
Investigating the case when the submartingale property holds for such sample sets, relaxing the 
submartingale property in a controlled way, and deriving new convergence results is a subject 
of our future research. 
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